Visually Grounded Word Embeddings and Richer Visual Features for Improving Multimodal Neural Machine Translation
نویسندگان
چکیده
In Multimodal Neural Machine Translation (MNMT), a neural model generates a translated sentence that describes an image, given the image itself and one source descriptions in English. This is considered as the multimodal image caption translation task. The images are processed with Convolutional Neural Network (CNN) to extract visual features exploitable by the translation model. So far, the CNNs used are pre-trained on object detection and localization task. We hypothesize that richer architecture, such as dense captioning models, may be more suitable for MNMT and could lead to improved translations. We extend this intuition to the word-embeddings, where we compute both linguistic and visual representation for our corpus vocabulary. We combine and compare different configurations and show state-of-the-art results according to previous work.
منابع مشابه
Incorporating visual features into word embeddings: A bimodal autoencoder-based approach
Multimodal semantic representation is an evolving area of research in natural language processing as well as computer vision. Combining or integrating perceptual information, such as visual features, with linguistic features is recently being actively studied. This paper presents a novel bimodal autoencoder model for multimodal representation learning: the autoencoder learns in order to enhance...
متن کاملImproving Word Sense Disambiguation in Neural Machine Translation with Sense Embeddings
Word sense disambiguation is necessary in translation because different word senses often have different translations. Neural machine translation models learn different senses of words as part of an end-to-end translation task, and their capability to perform word sense disambiguation has so far not been quantified. We exploit the fact that neural translation models can score arbitrary translat...
متن کاملVisually Aligned Word Embeddings for Improving Zero-shot Learning
Zero-shot learning (ZSL) highly depends on a good semantic embedding to connect the seen and unseen classes. Recently, distributed word embeddings (DWE) pre-trained from large text corpus have become a popular choice to draw such a connection. Compared with human defined attributes, DWEs are more scalable and easier to obtain. However, they are designed to reflect semantic similarity rather tha...
متن کاملWMT 2016 Multimodal Translation System Description based on Bidirectional Recurrent Neural Networks with Double-Embeddings
Bidirectional Recurrent Neural Networks (BiRNNs) haveshown outstanding results on sequence-to-sequence learning tasks. This architecture becomes specially interesting for multimodal machine translation task, since BiRNNs can deal with images and text. On most translation systems the same word embedding is fed to both BiRNN units. In this paper, we present several experiments to enhance a baseli...
متن کاملImproving Word Representations via Global Visual Context
Visually grounded semantics is a very important aspect in word representation, largely due to its potential to improve many NLP tasks such as information retrieval, text classification and analysis. We present a new distributed word learning framework which 1) learns word embeddings that better capture the visually grounded semantics by unifying local document context and global visual context,...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1707.01009 شماره
صفحات -
تاریخ انتشار 2017